Read from multiple datasets at once

Hello,
I wanted to ask if there is a a functionality in h5pyd to retrieve data from multiple datasets/ groups in one call.
Context:
I am asking for performance reasons, especially for reading fairly small subsets. My data forces me to put it in different datasets but I usually need to retrieve data from all datasets.

So far I haven’t found anything in the documentation except an old design document:

Has this been implemented?

Also I have tried to aceess different domains asynchronosly with the python asyncio librarly, which gave me the same runtime as accessing it sequentially. Maybe I made a mistake or does HSDS not support parallel requests from one client?

I appreciate your help. :slight_smile:
Leonard

Hi, Thanks for your question!

Yes, if you need to retrieve many small datasets, it can be a bit slow since the latencies between each request to HSDS add up.

When you experimented with asyncio, where you using the aiohttp package? Unless your http routines specifically support await, the calls are likely to be made sequentially anyway.

Another practical issue with asyncio is that unless your application is already designed with async processing in mind, it’s hard to bolt on some async functions later on.

Anyway, in order to provide a more practical way for Python users to benefit from parallel processing, we recently added a h5pyd feature to help with this use case: MultiManager. The MultiManager enables applications to read or write multiple selections from multiple datasets in one call. Internally, it uses Python threading to send one http request per selection in parallel.

The code is not yet in an official h5pyd release, but you can get it with a: $ pip install git+https://github.com/hdfgroup/h5pyd. Take a look at some of test code from:

and it should be fairly clear how it works. If anything is unclear, please let us know.

Once you’ve tried out MultiManager, I’d be curious to hear what kind of performance benefit you see. In our testing speedup varies quite a bit depending on the number of selections used, the size of the selections, and a host of other factors. Hopefully your application will get a good speedup!

1 Like

Hi jreadey,
I am happy this feature has already been implemented in h5pyd. With the help of the code I was able to write a small benchmark script which compares accessing the data with the multimanager and sequentially. I tested it on effecively a netcdf file with one 5 dimensional variable and the 5 corresponding coordinate axis. So 6 variables in total with one being a lot larger than the other ones. I benchmarked by retrieving one random entry from each of the datasets in order to avoid caching effects.

Time for sequential access: ~400ms
Time with Multimanager: ~100ms

I got a factor of 4 of performance improvement compared to a theoretical limit of 6.

I would assume a large dataset creates a bit of search overhead. So retrieving a value from it takes longer than from the smaller ones. This would mean that accessing equal sized datasets would result better scaling of the MultiManager.

I wrote some generic benchmark code, feel free to use it:

def generate_range(ds_shape: tuple):
    # generate a tuple of random indices for one dataset
    indices = []
    for axis_length in ds_shape:
        index = random.randint(0, axis_length - 1)
        indices.append(index)
    return tuple(indices)


def generate_index_query(h5file):
    # generate a list of index tuples
    query = []
    for ds in h5file.values():
        ds_shape = ds.shape
        indices = generate_range(ds_shape)
        query.append(indices)
    return query


def benchmark_multimanager(h5file, num=10):
    """
    Benchmark retrieving one random entry from every dataset in an h5file 
    using the MultiManager.
    """
    ds_names = list(h5file.keys())
    datsets = [h5file[name] for name in ds_names]
    mn = h5pyd.MultiManager(datsets)

    # prepare queries to exclude from runtime
    queries = []
    for i in range(num):
        query = generate_index_query(h5file)
        queries.append(query)

    # accessing the data
    t0 = time()
    for query in queries:
        results = mn[query]

    runtime = time() - t0
    print(f"Mean runtime multimanager: {runtime/num} ")
    # 100ms for case with 6 datasets


def benchmark_sequential_ds(h5file, num=10):
    """
    Benchmark retrieving one random entry from every dataset in 
    an h5file by sequentially looping through the datasets
    """
    # prepare queries to exclude this code from runtime
    index_lists = []
    for i in range(num):
        index_list = []
        for ds in h5file.values():
            indices = generate_range(ds.shape)
            index_list.append(indices)
        index_lists.append(index_list)

    # accessing the data
    t0 = time()
    for index_list in index_lists:
        for indices, ds in zip(index_list, h5file.values()):
            result = ds[indices]

    runtime = time() - t0
    print(f" Mean runtime sequentially: {runtime/num} ")
    # ~ 400ms for case with 6 datasests

Will the Multimanager be added to the next release?

1 Like

Happy to hear that the MultiManager worked so well for you!
Yes, it will be in the next h5pyd release (I might add your benchmark script as well).

I’ve checked in Leo’s benchmark test to h5pyd here: h5pyd/examples/multi_mgr_benchmark.py at master · HDFGroup/h5pyd · GitHub.

Got the following results on AWS:


 python multi_mgr_benchmark.py 
Mean runtime sequentially: 3.7388 s
Mean runtime multimanager: 0.4490 s

More than a 8x speedup! (ymmv)

Also added a notebook example here: h5pyd/examples/notebooks/multi_manager_example.ipynb at master · HDFGroup/h5pyd · GitHub

1 Like

You tested with a local file right?
If so, I am quite impressed that there was so much to gain even locally. For remote access I would assume that it scales even better.

No, this was testing against S3 (and re-starting HSDS to negate any caching effects).
There was significant speedup with local data as well.

1 Like